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A PRACTICAL  GUIDE  TO  RELIABILITY 


Assorted  military  handbooks  and  other  detailed  references  abound  for 
the  specialist  in  the  field  of  reliability.  However,  most  of  the  references 
are  too  detailed  and  too  mathematical  to  be  useful  as  an  introduction  to 
, reliability  for  students  of  project  management  and  for  others  who  seek 

only  a familiarity  with  basic  principles.  For  these  individuals,  there  is 
a need  for  a brief  discussion  of  what  reliability  is  and  how  it  is  achieved 
during  the  systems  acquisition  process.  This  booklet  is  an  attempt  to 
fill  that  need. 
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INTRODUCTION 


Reliability  as  a formal  discipline  in  design  and  production  is  relatively 
new;  yet  the  concept  is  old.  The  designers  of  the  first  steam  ships  were  con- 
cerned about  the  ability  of  the  boilers  and  engines  to  withstand  the  long 
transocean  crossing;  therefore,  they  provided  redundancy  in  the  form  of  sails. 
Long  after  electric  starters  became  standard  equipment  on  American  automobiles, 
hand  cranks  were  still  provided  to  insure  reliable  starting.  In  the  past, 
designers  were  concerned  with  the  same  questions  which  are  raised  today  in 
connection  with  reliability:  Will  the  device  work  when  it  is  needed?  Will  it 

work  long  enough  to  perform  its  intended  function?  What  are  the  costs  (both 
monetary  and  opportunity)  associated  with  a failure?  The  concern  today  is 
even  greater,  because  the  consequences  of  unreliable  weapons  and  equipment 
are  graver--in  terms  of  cost,  in  terms  of  safety,  and  in  terms  of  accomplishing 
the  mission. 

Assorted  military  handbooks  and  other  detailed  references  abound  for  the 
specialist  in  the  field  of  reliability.  However,  most  of  the  references  are 
too  detailed  and  too  mathematical  to  be  useful  as  an  introduction  to  reliabil- 
ity for  students  of  project  management  and  for  others  who  seek  only  a familiar- 
ity with  basic  principles.  For  these  individuals,  there  is  a need  for  a brief 
discussion  of  what  reliability  is  and  how  it  is  achieved  during  the  systems 
acquisition  process.  This  booklet  is  an  attempt  to  fill  that  need. 

SYSTEM  EFFECTIVENESS 


A discussion  of  reliability  should  begin  by  relating  it  to  the  overall 
measure  of  a system's  utility:  system  effectiveness.  The  effectiveness  of  a 

system  can  be  viewed  as  a combination  of  three  factors,  availability,  depend- 
ability, and  capability  (figure  1). 

Availability  - Is  the  system  ready  to  operate  when  called 

on? 

Dependability  - Will  the  system  continue  to  operate  properly 

for  the  required  duration  of  the  mission? 

Capability  (Performance)  - If  the  system  performs  as  designed,  is  it 

capable  of  accomplishing  the  mission? 


Figure  1 is  an  oversimplification  of  system  effectiveness  because  it 
omits  a host  of  other  factors  which  affect  availability,  dependability,  and 
capability.  However,  the  figure  emphasizes  three  of  the  most  important  of 
these  factors;  reliability,  maintainability  and  logistical  support.  In  fact 
these  three  factors  are  so  vital  and  so  interrelated  to  availability  and 
dependability  (which  is  really  synonomous  with  reliability)  that  they  are 
usually  taught  and  discussed  together  under  the  heading  of  RAM  (reliability, 
availability,  maintainability).  This  booklet  focuses  only  on  reliability 
partially  for  simplicity  of  presentation  and  partially  because  of  the  com- 
pelling logic  that  improvements  in  reliability  ought  to  significantly  decrease 
the  need  for  maintenance  and  its  associated  logistic  support. 


DEFINITION  OF  RELIABILITY 


Reliability  is  a quantitative  concept.  It  is  the  probability  that  if  an 
item  is  put  to  use  under  specified  operating  conditions,  it  will  perform  its 
intended  function  for  a specified  interval.  (The  interval  can  be  time,  miles, 
cycles,  rounds,  etc.) 

How  is  reliability  computed?  To  answer  that  question,  let  us  consider  the 
meaning  of  "probability."  If  an  experiment  is  performed  under  identical  con- 
ditions N times,  and  a particular  result  occurs  A times,  the  probability  of 
A's  occurrence,  P(A),  is  defined  as  the  limit  of  the  ratio  A/N  as  N becomes 
infinite. 

P(A)  = limit  A 

N — ► «*  N 


In  practice  we  perform  the  experiment  some  reasonably  large  number  of  times 
and  use  the  resulting  ratio,  A/N,  as  an  estimate  of  the  true  probability  to 
predict  the  outcome  in  the  future.  To  see  how  probability  relates  to  relia- 
bility we  will  look  at  two  examples.  The  first  is  an  artillery  piece  example 
for  which  the  specified  reliability  interval  is  consecutive  rounds  fired. 

The  second  example  is  electronic  components,  for  which  the  specified  interval 
is  operating  time. 


Example  #1.  Artillery  Howitzer.  Consider  the  development  of  a new  type  of 
artillery  howitzer.  We  would  like  to  estimate  the  probability  that  the  howit- 
zer will  fire  a round  in  125°F  weather  without  misfiring  or  jamming.  If  we 
test  10,000  rounds  and  observe  only  10  misfires  (failures),  we  could  estimate 
that  the  probability  of  firing  any  single  round  successfully  is: 

P(succ.ess,  1 round)  = 9990  = .999 

10,000 

Now,  an  artillery  officer  might  ask  the  question:  what  is  the  probability 

that  the  howitzer  can  fire  30  consecutive  rounds  during  a mission  without  any 
failures?  According  to  the  laws  of  probability,  the  probability  of  any  number 
of  independent  events  occurring  consecutively  is  equal  to  the  product  of  the 
probabilities  of  occurrence  for  each  single  event.  Thus,  the  probability  that 
all  of  the  30  consecutive  rounds  will  fire  successfully  is: 

( .999) ( .999) ( .999) (.999)  = (.999)30  = .97 


After  this  calculation,  we  have  now  specified  all  of  the  elements  of  reliabil- 
ity: a probability  (.97),  an  operating  condition  (125°F),  a function  (firing), 

and  an  interval  (30  rounds).  Under  these  conditions,  the  reliability  of  complet- 
ing the  mission  is  .97.  If  we  repeat  the  probability  calculation  for  various 
numbers  of  rounds  and  plot  the  results,  (figure  2),  we  can  show  how  reliability 
varies  for  missions  of  from  30  to  800  rounds  in  length. 

Another  useful  way  to  express  the  reliability  of  this  howitzer  is  by  its 
mean  rounds  between  failure  (MRBF).  For  this  example,  the  mean  rounds  between 
failure  is  computed  as: 


MRBF  = Total  Rounds  Fired 

Total  Number  of  Failures 


= 10,000  = 1000  rounds 
10 


• ~~  flBPT  r 


- 


In  the  first  example,  time  was  not  a factor.  There  was  little  difference 
whether  the  mission  was  accomplished  in  two  hours  or  ten  hours  (assuming  the 
operating  conditions  were  unchanged,  of  course).  However,  for  most  items  of 
hardware,  mission  success  is  related  to  time  or  some  time-dependent  variable 
such  as  miles  or  cycles.  The  next  example  illustrates  this  for  the  case  of 
an  electronic  component. 

Example  #2.  Electrical  Resistor.  Consider  a particular  type  of  elec- 
trical  resistor.  We  would  like  to  know  the  possibility  that  this  type  of 
resistor  will  be  able  to  operate  continuously  at  50°C  for  50,000  hours 
(about  6 years)  without  failing.  We  could  estimate  this  probability  by 
applying  our  earlier  definition  of  probability,  A/N,  i.e.,  performing  an  ex- 
periment N times  and  observing  the  number  of  times  (A)  the  resistor  speci- 
men was  still  in  operation  after  50,000  hours.  There  is  no  reason  why  we 
could  not  conduct  all  of  the  experiments  concurrently,  if  we  insured  that 
each  resistor  operated  independently.  Starting  with  1000  perfect^  resistors 
(N),  we  might  expect  the  results  to  look  like  figure  3a.  As  time  passes 
resistors  begin  to  faily  one  by  one,  the  failures  occurring  randomly  over 
time.  The  resistor  failures  are  caused  by  a complex  set  of  internal  phy- 
sical and  chemical  changes  which  result  from  applied  stresses  and  the 
effects  of  time.  After  50,000  hours,  there  are  607  resistors  still  opera- 
ting (A).  Therefore  A/N  = 607/1000  = .607  is  our  experimental  estimate  of 
the  probability  that  any  resistor  of  this  type  can  operate  for  at  least 
50,000  hours. 

By  repeating  the  calculation  for  earlier  values  of  time  and  corresponding 
numbers  of  still  operational  resistors,  we  can  estimate  the  probability  that 
for  any  given  length  of  time,  t,  a resistor  of  this  type  would  operate  with- 
out failing.  These  probabilities  are  represented  by  the  curve  at  figure  3b. 
Once  again  all  of  the  elements  of  definition  of  reliability  are  present: 
probability,  specified  interval  (time),  function  (operate),  and  conditions 
(50°C).  Therefore,  the  curve  in  figure  3b  is  also  a reliability  curve  for 
this  type  of  resistor. 


^By  "perfect"  resistors,  we  mean  that  there  are  no  defective  units 
or  partially  defective  units  which  might  have  failed  early  in  the  test. 
Also,  we  must  assume  that  this  type  of  resistor  does  not  significantly 
deteriorate  or  wear  out  within  50,000  hours.  Similarly,  example  #1 
assumes  that  even  the  longest  mission  length  (800  rds  in  figure  2)  does 
not  exceed  the  wearout  life  of  the  howitzer. 
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Failure  Rate 


Next,  we  need  to  discuss  a key  reliability  parameter:  failure  rate. 
Failure  rate  is  a measure  of  the  number  of  failures  experienced  per  unit  of 
time,  i.e.,  failures  per  hour  or  failures  per  1000  hours,  etc.  When  a num- 
ber of  units  are  being  tested,  the  failure  rate  is  computed  by  dividing  the 
number  of  failures  durjng  some  small  time  interval,  t,  by  the  average  number 
of  units  under  test  during  t,  and  then  dividing  again  by  t. 

Failure  rate  = No.  of  Failures  in  t/Average  No.  of  units  under  test  in  t 

t 

Defined  this  way,  failure  rate  is  relative  rate,  i.e.,  its  dimensions  are 
failures  per  unit  under  test  per  increment  of  time.  If  we  looked  at  the 
detailed  records  for  our  resistor  experiment,  we  could  develop  the  matrix 
at  figure  4a. 

I Note  that  for  three  separate  time  intervals,  the  computed  failure  rate 

was  approximately  constant.  In  fact,  if  we  picked  a number  of  time  inter- 
vals from  the  hypothetical  records,  the  computed  failure  rate  would  remain 
approximately  constant,  as  shown  In  figure  4b.  When  this  constant  failure 
rate  occurs  in  nature,  it  leads  to  a mathematical  expression  for  reliability 
called  an  exponential  function.  For  a constant  failure  rate,  X , the  relia- 
bility, R,  for  any  mission  time,  t,  is  given  by  the  function  R = e”  xt. 

The  curve  in  figure  3b  represents  this  exponential  reliability  function  for 
our  resistor's  constant  failure  rate  of  X = . 00001. 1 

Mean  Time  Between  Failures  (MTBF) 

Another  much  used  reliability  parameter  is  the  mean  time  between  failures 
(MTBF) . For  items  which  have  an  exponential  reliability  function,  i.e.,  con- 
stant  failure  rate,  MTBF  is  the  reciprocal  of  failure  rate.  For  our  resistor 
example,  the  MTBF  is: 

MTBF  = 1 = 1 = 100,000  hours  per  failure  of  a particular  unit 

X .00001 

When  referring  to  the  reliability  of  a system  or  a piece  of  equipment, 
MTBF  is  useful  because  it  relates  readily  to  mission  length.  For  example, 
consider  a system  which  has  a typical  mission  length  of  10  hours  and  a tenta- 
tive reliability  requirement  of  .9.  We  would  like  to  know  (1)  how  large  the 
MTBF  for  this  system  should  be,  and  (2)  how  sensitive  mission  reliability  is 
to  variations  in  MTBF.  If  our  piece  of  equipment  has  an  exponential  reliabi- 
lity function,  then  we  know  that: 

Reliability  = = e"  */MTBF  _ g-  10  hrs/MTBF 

Solving  this  equation  for  MTBF  gives  us: 


f: 


MTBF  = 


-10 

In 


79" 


= 94.9  hrs 


We  could  also  have  found  the  answer  graphically  by  referring  to  the  curve  of 
R = e_^/MTBFt  shown  in  Figure  4c.  The  curve  also  indicates  that  to  improve 
system  reliability  much  above  .9  requires  a large  improvement  in  MTBF;  this 
might  not  be  worth  the  cost  and  effort. 


^In  practice,  today's  resistors  have  failure  rates  which  are  100  to  1000 
times  better  than  the  figure  used  in  our  example. 


C:  . M JO 


it'.y..  •* 


The  "Bathtub"  Curve 


In  developing  the  discussion  of  the  hypothetical  resistor  experiment, 
we  stressed  that  the  1000  test  resistors  were  "perfect",  i.e.,  free  from 
defects  which  would  cause  early  failures.  In  reality,  this  is  never  the 
case.  Due  to  the  variability  in  manufacturing  process  and  the  fallibility 
of  quality  control  inspections,  any  population  of  components  wi 1 1 contain 
some  defective  or  weak  units.  If  the  defect  is  serious  enough  to  render  the 
component  inoperable  initially  (zero  time  defects)  it  would  naturally  be 
eliminated  before  it  is  put  to  use.  However,  many  latent  defects  are  not 
obvious  until  after  power  is  applied  and  heat  is  generated.  These  "latent" 
defects  contribute  to  a relatively  high  failure  rate  during  the  early  stages 
in  the  life  of  component  population. 


If  we  were  to  use  a real  population  of  components  in  our  resistor  experi- 
ment, the  actual  curve  representing  the  variation  of  failure  rate  with  time 
would  look  like  figure  5.  During  the  first  several  hundred  hours,  the  failure 
rate  would  be  relatively  high  as  the  defective  or  weak  components  failed  one 
by  one.  This  period  is  referred  to  as  the  infant  mortality  or  burn-in  period. 
After  the  weak  components  are  weeded  out,  the  population  failure  rate  settles 
down  to  a nearly  constant  level  sometimes  referred  to  as  the  "base"  failure 
rate.  This  period  is  called  the  useful  life  period,  because  it  is  here  that 
components  are  used  to  their  greatest  advantage.  Had  we  continued  our  ex- 
periment beyond  50,000  hours,  we  would  have  reached  the  third  typical 
period  in  life  of  components,  the  old  age  or  wearout  period.  During  this 
period  the  failure  rate  climbs  as  components  begin  to  deteriorate  rapidly. 


Limitations  of  the  Exponential  Reliability  Function 

Not  all  items  exhibit  failure  rates  which  are  constant  over  some  por- 
tion of  their  life.  Electrical  components  and  some  other  parts  usually  do; 
and  the  exponential  reliability  function  which  results  is  very  convenient  to 
handle  mathematically.  But  many  items  exhibit  failure  rates  which  increase 
or  decrease  with  time  because  of  some  physical  process  such  as  gradual 
wearing,  corrosion,  or  work  hardening.  When  the  failure  rate  is  not  approxi- 
mately constant,  the  exponential  expression  for  reliability  is  inapplicable. 
In  such  cases  other  mathematica  I functions  such  as  the  Weibull  , the  Normal, 
the  Log-Normal,  and  the  Extreme  Value  must  be  used.  Most  reliability  texts 
contain  detailed  discussions  of  these  reliability  laws. 


As  the  operating  time  for  a piece  of  equipment  approaches  the  wearout 
time  of  one  of  its  components,  the  component  or  part  must  be  replaced  during 
planned  maintenance  in  order  to  avoid  subsequent  failures  at  inopportune 
times. 
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Equipment  Reliability 


•• 


Item  No. 
1 
2 

3 

4 

5 


Description 

Resistor 

Capacitor 

Resistor 

Diode 

Diode 

Total 


Failure  Rat, 

i-5 


1.0  x 10 

5.0  x 10 

1.0  x 10 


-5 

-5 


1.5  x 10 
1.5  x 10 


-5 

-5 


10.0  x 10 


-5 


Fig. 6a 


Fig.  6b 


Hardware  Reliability  Prediction 


Assume  we  build  a small  item  of  hardware  using  one  of  our  resistors 
and  four  other  electrical  components.  Assume  we  have  tested  each  type  of 
component  to  determine  its  failure  rate  and  the  results  are  at  figure  6a. 
Further,  assume  we  have  connected  the  components  in  a series  fashion  such 
that  failure  of  any  one  of  them  will  cause  a failure  of  our  piece  of  hard- 
ware. Given  a mission  length,  t,  how  do  we  calculate  the  reliability  of 
the  hardware  item? 

Since  each  component  makes  a contribution  to  the  overall  failure  rate 
of  the  piece  of  hardware,  we  can  simply  add  the  individual  failure  rates  to 
give  a combined  hardware  failure  rate.' 

Failure  rate  (X.h)  aX-|  + +^-3  + ^4  +^5 

Xh  = 1.0  x 10‘5  + 5.0  x 10"5  + 1.5  x 10'5  + 1.5  x 10'5 

Xh  = 10.0  x io-5 

Now,  using  the  hardware  failure  rate,  we  can  compute  the  hardware  reliability, 
from  R = e . This  is  plotted  in  figure  7 for  any  t. 

Note  how  the  reliability  of  the  combination  (figure  7)  has  been  de- 
graded compared  to  the  reliability  of  our  single  resistor  (figure  3b).  The 
culprits  were:  (1)  the  fact  that  we  had  to  use  more  components,  all  of  which 
contributed  to  the  unreliability  of  the  system  and  (2)  Item  2,  which  had  a 
failure  rate  significantly  higher  than  the  other  components.  Imagine  adding 
up  the  failure  rates  of  the  thousands  of  series  components  contained  in 
some  of  our  military  systems:  It  is  plain  to  see  why  two  primary  objectives 

of  any  reliability  program  are:  (1)  minimize  the  number  of  parts,  and  (2) 
choose  the  most  reliable  parts  available  within  the  constraints  of  cost, 
schedule,  and  space. 


^In  this  example,  adding  failure  rates  is  mathematically  equivalent 
to  multiplying  individual  reliabilities  because  we  have  an  exponential 
expression  for  reliability: 


R (hardware)  = 


t - X 
) (e 


- X4t 
(e  4 


) 


Mission  Reliability 


Mission  Lengths 


A RELIABILITY  PROGRAM 


How  are  reliability  requirements  established?  What  steps  does  a con- 
tractor take  during  design,  development,  and  production  to  enhance  reliabi- 
lity? How  does  the  government  contract  for  reliable  products  and  effectively 
manage  a reliability  program?  What  are  some  of  the  major  obstacles  and  pro- 
blems? These  are  the  questions  which  the  remainder  of  this  booklet  will 
address. 


ESTABLISHING  RELIABILITY  REQUIREMENTS 

Reliability  begins  with  a realistic,  achievable  requirement.  For  mili- 
tary hardware,  the  requirement  is  established  jointly  by  the  military  user 
and  the  military  developer  in  the  following  manner. 

The  first  step  is  to  evaluate  the  reliability  of  systems  currently  in 
the  field.  This  evaluation  indicates  the  status  of  current  reliability  levels 
and  the  trends  of  reliability  improvement. 


The  second  step  is  to  conduct  a thorough  systems  analysis  Involving  trade- 
offs between  reliability  levels,  mission  performance,  and  logistical  factors. 
This  analysis  will  indicate  the  reliability  level  which  is  actually  needed 
and  appears  affordable.  Figure  8 shows  for  a typical  artillery  piece  an 
example  of  the  sensitivity  of  reliability  to  mean  rounds  between  failure 
(MRBF)  with  various  assumed  mission  lengthsJ 

The  third  step  is  a technical  assessment  of  the  tenative  requirement. 

This  considers  the  technical  feasibility  of  attaining  the  desired  reliability 
goal,  the  schedule  implications  of  striving  for  that  goal,  and  such  factors 
as  the  ability  to  determine  by  testing  whether  or  not  the  equipment  has 
reached  its  reliability  goal. 

The  final  result  is  a reliability  requirement  which  is  usually  stated 
with  two  values:  a specified  value,  which  is  the  value  the  developer  will 

use  as  a design  requirement  and  a minimum  acceptable  value,  which  represents 
the  least  operational  capability  the  user  can  tolerate. 


^Probably  the  most  crucial  part  of  setting  reliability  requirements  is 
developing  a complete  and  accurate  system  definition.  Figure  9 illustrates  one 
of  the  major  difficulties:  what  is  a typical  mission?  86  rounds?  425  rounds? 

or  50  rounds?  Obviously,  there  can  be  a number  of  "typical"  missions  depend- 
ing on  the  situation.  In  the  case  of  aviation  systems,  the  definition  becomes 
even  more  difficult.  Is  an  aircraft  performing  an  intercept  mission,  a ground 
support  mission,  or  a reconnaissance  mission?  Is  an  air-to-air  missile  flying 
for  the  first  time  or  the  tenth  time?  Clearly,  different  missions  and  operat- 
ing modes  may  require  different  reliability  requirements. 

A complete  definition  of  the  mission  must  also  include  the  anticipated 
environmental  conditions  in  which  the  item  may  operate  (levels  of  temperature, 
humidity,  vibration,  shock,  salt  spray,  altitude,  etc.)  and  the  length  of  time 
in  each. 
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RELIABILITY  IN  THE  DESIGN  PHASE 

The  reliability  of  a product  depends  primarily  on  its  design.  The  best 
manufacturing  techniques  and  the  most  thorough  testing  cannot  improve  an  item's 
reliability  beyond  that  which  is  inherent  in  its  design.  It  is  here  that  the 
designer  must  make  tradeoffs  with  performance  and  use  special  techniques  to 
enhance  reliability  (figure  9). 

The  following  list  summarizes  some  basic  techniques  which  are  used 
during  the  design  phase. 


1.  Know  the  True  Environmental  Conditions 

2.  Keep  the  Design  Simple 

3.  Develop  an  Accurate  Model. 

4.  Select  Reliable  Parts 

5.  Apply  Parts  Properly  in  the  Design 

| 

6.  Conduct  Thorough  Design  Reviews 

These  techniques  are  not  employed  strictly  in  the  order  listed  because  the  pro- 
cess is  very  iterative,  (Analysis  discovers  problems  which  require  redesign 
using  different  parts  and  so  forth.)  A brief  discussion  of  each  technique 
follows. 

Know  the  True  Environmental  Conditions 

Overall  environmental  conditions  are  well  known  even  before  the  design 
phase  begins.  However,  the  environmental  conditions  so  defined  are  more  descrip 
tive  of  the  whole  system  rather  than  of  its  elements.  The  designer  must  deter- 
mine the  appropriate  levels  of  temperature,  vibration,  etc.  for  each  location 
: i within  the  system.  Detailed  environmental  profiles  may  identify  local  extremes 

which  dictate  relocation  of  sensitive  items  to  a more  environmentally  benign 
location.  Figure  10  shows  a typical  profile  for  three  elements  of  the  environ- 
ment during  a four-hour  aircraft  mission. 
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Integrated  Circuit  With  Finned  Stud  For  Cooling 


Keep  the  Design  Simple 

The  need  for  simplicity  Is  as  important  to  reliability  as  it  is  to  so 
many  other  aspects  of  our  complicated  lives.  Yet,  the  demands  of  complex 
mission  performance  requirements  and  the  natural  inventiveness  of  engineers 
can  act  as  powerful  forces  to  undermine  simplicity.  Field  records  show 
unmistakable  correlation  between  poor  reliability  and  unnecessarily  complex 
designs  utilizing  parts  which  do  not  have  a proven  track  record  of  reliable 
performance. 

A fundamental  goal  of  every  designer  should  be  to  minimize  the  total 
number  of  parts,  either  by  clever  design,  by  combining  several  parts  into 
one,  or  by  assigning  several  functions  to  one  part.  Until  recently,  one 
could  demonstrate  a rather  accurate  inverse  mathematical  relationship  between 
the  number  of  discrete  "active"  electronic  elements  in  a design  and  the  in- 
herent reliability  of  the  resultant  piece  of  hardware  (figure  11).  In  recent 
years,  more  widespread  use  of  integrated  semi-conductor  circuits  (figure  12) 
has  brought  about  such  improvement  in  the  reliability  of  electronic  components, 
that  this  relationship  is  no  longer  strictly  applicable,  but  the  fundamental 
principle  of  low  parts  count  is. 

Curiously,  as  integrated  circuits  have  reduced  the  physical  space 
required  to  package  electronics,  more  space  has  been  created  to  pack  in 
additional  electronics.  Thus,  the  battle  to  minimize  the  parts  count  is 
a never-ending  one  for  the  designer  and  the  reliability  engineer. 
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Develop  A Good  Model 


Developing  a good  reliability  model  actually  begins  in  the  conceptual 
phase.  The  first  step  Is  to  completely  define  the  system  in  terms  of  its 
various  subsystems  and  items  of  equipment.  This  is  essentially  constructing 
a work  breakdown  structure  of  the  hardware-related  items  (sometimes  called 
a "system  tree".)  Figure  13a  is  a simplified  example  of  such  a system  tree. 

The  second  step  is  to  construct  a functional  block  diagram  which  indicates 
the  functional  relationship  of  all  items  in  the  system  tree  and  the  sequence 
in  which  they  must  perform  for  the  system  to  operate  successfully.  This  block 
diagram  becomes  very  complicated,  but  it  is  constructed  of  combinations  and 
modifications  of  just  two  basic  model  building  blocks:  the  series  block  and 

the  parallel  block.  In  the  series  block,  the  failure  of  any  one  element 
causes  a block  failure.  The  reliability  of  the  block  is  equal  to  the  product 
of  the  individual  element  reliabilities  (figure  13b).  In  a simple  parallel 
block  (simple  redundancy)  the  failure  of  any  one  element  does  not  affect  the 
function  which  the  block  performs.  Reliability  of  the  parallel  block  R<j  is 
found  by  repeated  application  of  the  following  equation: 

R<.  = (Rs  if  X-j  works) (Probability  X-|  works)  + (R$  if  X-j  fails)(Prob  X-|  fails) 

Figure  13c  illustrates  the  application  of  this  equation  for  a two  element 
parallel  building  block.  Building  on  these  basic  series  and  parallel  blocks, 
one  then  develops  a mathematical  equation  which  expresses  overall  system  reli- 
ability in  terms  of  the  reliability  levels  of  subsystems  and  pieces  of  equipment. 
This  is  the  reliability  model.  (MIL-HDBK  217B  has  detailed  discussion  of  modeling.) 


How  reliable  must  each  subsystem  and  piece  of  equipment  be  in  order  to 
provide  a desired  overall  system  reliability?  The  first  cut  at  answering  this 
question  occurs  during  the  conceptual  phase.  Starting  at  the  top  of  the 
system  tree,  reliability  levels  are  allocated  or  apportioned  among  the  various 
subsystems.!  Assumptions  are  made  about  the  degree  of  reliability  one  can 
realistically  expect,  given  the  state  of  the  art  and  the  reliability  of 
similar  items  in  current  use.  The  allocation  process  is  repeated  at  suc- 
cessively lower  levels  in  the  system  tree  until,  as  a rule,  every  item 
down  to  the  equipment  or  equipment  module  level  has  been  allocated  a relia- 
bility goal  or  "budget". 


During  the  development  phase,  design  engineers  begin  selecting  detailed 
parts  and  applying  them  in  specific  circuit  designs.  Reliability  engineers 
assess  the  suitability  of  the  design  by  calculating  reliability  predictions. 

The  predictions  are  based  on  established  or  assumed  failure  rates  for  each 
component  part  and  estimated  part  stresses  such  as  voltage,  power,  tempera- 
ture, etc.  The  reliability  predictions  build  from  the  bottom  of  the  sys- 
tem tree  upward  until  an  estimate  for  the  system  is  predicted.  At  all  levels, 
predictions  are  compared  with  previous  allocations.  Differences  are  resolved 
by  redesign  or  re-allocation  of  the  reliability  budget.  This  iterative  process 
is  repeated  throughout  the  design  phase  to  insure  that  reliability  is  "designed 
in." 


Hhe  prudent  designer  usually  starts  with  a design  reliability  goal 
which  is  at  least  ]25%  of  the  requirement,  expressed  in  terms  of  MTBF. 
This  provides  an  overall  "safety  factor." 
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RELIABILITY  OF  ELECTRONIC  COMPONENTS 


Fig.  14 


Se 1 ec t Reliable  Parts 


Parts  vary  greatly  in  their  reliability.  A 100  ohm  resistor  used  in  a 
portable  television  may  have  a tolerance  of  + 10%,  a failure  rate  of  1 per  ten 
thousand  hours,  and  a cost  of  8<t,  while  a lOff  ohm  resistor  in  a strategic  mis- 
sile probably  has  a tolerance  of  less  than  + 1%,  a failure  rate  of  less  than 
1 per  million  hours,  a cost  in  excess  of  $1.  Choosing  reliable  electronic 
components  depends  not  only  on  the  required  tolerances  and  basic  failure  rates, 
but  also  on  the  degree  to  which  infant  mortalities  must  be  eliminated  from 
the  population. 

Electronic  components  are  generally  classified  into  three  reliability 
categories  (figure  14). 

1.  Commercial  or  Industrial . These  are  generally  good  quality  parts 
which  any  vendor  can  design  and  manufacture  to  whatever  reliability  level  is 
dictated  by  his  market.  These  parts  are  typically  used  in  such  applications 
as  television,  hi-fi,  radio,  expensive  consumer  goods  and  some  military  ground 
support  equipment. 

2.  Military  Standard.  These  are  higher  grade  parts  available  only  from 
qualified  sources  who  have  manufactured  and  tested  them  according  to  strict  mili- 
tary quality  standards.  They  are  roughly  5-10  times  more  reliable  than 
commercial  parts  and  are  used  in  such  items  as  tactical  missiles,  communication 
equipment,  and  vehicles. 

3.  High  Reliability.  "HIGH  REL"  components  are  the  highest  grade-- 
roughly  5-10  times  more  reliable  than  MIL-STD  parts.  In  addition  to  under- 
going inspections  after  almost  every  step  of  the  manufacturing  cycle,  these 
parts  are  subjected  to  an  array  of  very  stressing  environmental  tests.  The 
objective  is  to  screen  out  all  units  with  latent  quality  defects--the  infant 
mortalities.  Applications  such  as  aircraft  avionics,  satellites,  strategic 
missiles,  and  "wooden  round"  tactical  missiles  generally  require  HIGH  REL  com- 
ponents. 

If  reliability  is  to  be  designed  into  a system,  the  reliability  of  the 
individual  components  must  be  known  or  at  least  estimated.  Extensive  testing 
of  MIL-STD  and  HIGH  REL  parts  has  led  to  the  development  of  standardized  tables 
for  base  failure  rates  under  varying  conditions  of  temperature  and  voltage 
stress.  By  referring  to  these  tables  (or  other  lists  of  preferred  parts), 
a designer  can  choose  a part  which  has  a proven  failure  rate  consistent  with 
the  apportioned  reliability  goals  for  the  article  under  design.  Usually 
he  cannot  afford  the  luxury  of  calling  out  all  HIGH  REL  parts.  Due  to  the 
rigorous  control  under  which  they  are  manufactured  and  the  relatively  low 
percentage  of  parts  which  pass  subsequent  screening  tests,  the  cost  of  HIGH 
REL  parts  is  2 - 3 times  MIL-STD  parts,  and  5-10  times  commercial  equiva- 
lents. A further  limitation  is  that  there  are  a limited  number  of  qualified 
suppliers  and  their  output  is  limited. 
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Reliability  Of  Item  Percent  of  Rated  Load 
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RELIABILITY  OF  AN  ITEM  WITH  N REDUNDANT 
COMPONENTS,  EACH  WITH  R - .5  or  .8 


Fig.  16 
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Apply  Parts  Properly 


After  selecting  reliable  parts  with  (preferably)  known  failure  rates,  one 
must  insure  that  their  inherent  reliability  is  not  degraded  by  interactions 
within  the  design,  such  as  excessive  surges  of  electrical  current  or  damaging 
heat  generated  from  surrounding  components.  A number  of  analysis  techniques 
are  used  to  pinpoint  potential  problems.  Several  of  them  are: 


• Worst  Case  Analysis,  which  evaluates  design  performance  under  all 
possible  extremes  of  electrical  and  physical  environment. 


• Tolerance  Analysis,  which  evaluates  the  build-up  effect  of  individual 
part  tolerances,  each  of  which  may  be  allowable,  but  the  sum  of  which 
may  cause  unacceptable  conditions. 


• Failure  Modes  and  Effects  Analysis  (FMEA),  which  predicts  the  most 
likely  cause  of  failure  for  each  part  and  then  evaluates  the  impact 
of  that  failure  on  the  remaining  system.  This  produces  a clear 
picture  of  likely  failure  patterns  and  critical  parts. 


Extensive  design  analysis  will  indicate  the  need  to  select  different 
parts,  or  employ  other  design  techniques  to  improve  reliability.  Three  of 
these  techniques  are  particularly  important:  derating,  redundancy,  and  local 
environmental  protection. 

• Derating.  Derating  is  simply  applying  a safety  factor.  For  a mechan- 
ical part,  it  means  choosing  or  designing  the  part  to  bear  a larger  mechanical 

load  than  the  part  is^xpsffed  to  encounter.  For  electronic  components,  it 
means  limiting  the  u^Wt4!  component  to  electrical  loads  which  are  less  than 
those  for  which  the  part  is  designed  or  rated--thus,  "derating."  The  degree 
of  derating  depends  on  factors  such  as  operating  temperature,  power  consump- 
tion, and  other  indices  of  stress.  A sample  derating  cnart  for  resistors  is 
shown  in  figure  15. 

• Local  Environmental  Protection.  Local  environmental  conditions 
which  are  too  severe  to  correct  by  relocation  or  derating  may  require 
special  design  features  such  as:  fins  to  conduct  heat  away,  seals  to  ex- 
clude humidity,  or  stiffeners  to  dampen  vibration. 

• Redundancy.  Redundancy  can  be  an  effective  way  to  improve  the  re- 
liability of  a critical  part.  Figure  13c  showed  how  double  redundancy 
increased  the  reliability  of  a component  from  .8  to  .96.  The  solid  curve 
of  figure  16  demonstrates  the  further  improvement  possible  by  adding  more 
redundant  components. 


The  dotted  curve  of  figure  16  illustrates  one  of  the  limitations  of  redun- 
dancy: if  you  don't  start  with  a fairly  reliable  part,  it  takes  a lot  of  redun- 
dancy to  reach  the  .99  level.  Other  limitations  which  prevent  the  use  of  redun- 
dancy as  a panacea  include:  (1) parts  count  goes  up  with  correponding  increases 
in  heat,  cost,  and  the  number  of  individual  part  failures  which  must  eventually 
be  repaired;  (2)  redundant  elements  sometimes  introduce  additional  failure 
modes;  and  (3)  more  sophisticated  maintenance  and  test  equipment  and  test 
circuitry  are  required  to  discern  partial  failures  of  a redundant  element. 
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Conduct  Thorough  Design  Reviews 


Unfortunately,  not  all  designers  have  the  experience  and  attitude  neces- 
sary to  systematically  consider  every  aspect  of  a design  at  the  time  it  is 
being  developed.  Under  the  stress  of  time  and  pressure  to  meet  performance 
requirements,  other  important  areas  often  are  neglected  or  compromised  exces- 
sively. The  next  best  thing  to  a design  without  errors  is  a design  review 
which  corrects  the  errors  before  they  become  "cast  in  hardware,"  so  to  speak. 

Design  reviews  provide  formalized  periodic  appraisal  of  the  design  to 
evaluate  its  progress  in  meeting  all  objectives--performance,  reliability, 
maintainability,  safety,  etc.  They  bring  specialized  talent  to  bear  on  specific 
problem  areas.  The  review  team  typically  consists  of  one  or  two  senior  design 
engineers,  several  project  engineers,  a reliability  engineer,  a maintainability 
engineer,  a value  engineer,  and  other  specialists  such  as  metallurgists,  human 
factors  engineer,  etc.  as  they  are  required.  The  optimal  review  team  size  is 
10  - 15. 


Prior  to  the  review,  each  member  is  furnished  with  a data  package  and 
copies  of  applicable  analyses  to  study.  To  insure  that  all  important  design 
considerations  are  reviewed,  a comprehensive  checklist  is  invaluable.  A 
sample  checklist  is  at  Appendix  A.  Problems  must  be  expected  and  frankly  dis- 
cussed by  both  designer  and  reviewer.  The  reviewer  should  not  expect  a 
finished,  perfect  product  or  else  the  designer  will  be  forced  to  cover  up  prob- 
lems to  present  a rosy  picture,  and  the  review  concept  will  be  of  little  use. 
Problems  which  cannot  be  solved  on  the  spot  are  assigned  as  action  items  to 
specific  individuals  for  resolution  by  a given  time.  The  design  review  is  not 
complete  until  all  action  items  are  resolved. 

Within  the  DOD  weapons  system  acquisition  process,  there  are  four  broad 
categories  of  design  reviews: 

• Preliminary  Design  Review 

• Interim  Design  Review 

• Critical  Design  Review 

• Production  Design  Review  (or  Final  Design  Review) 

The  approximate  timing  of  these  reviews  in  relation  to  other  design  activities 
is  shown  by  figure  17.  The  actual  number  of  design  reviews  held  by  military 
development  agencies  will  depend  on  the  number  of  critical  decision  points  in 
a given  program  and  the  philosophy  of  the  program  management  team. 

Some  often  cited  problems  with  government  design  reviews  are:  (1)  they 

are  omitted  or  shortened  due  to  the  pressure  of  time  and  money,  (2)  they  are 
attended  by  an  insufficient  number  of  qualified  people,  particularly  from  the 
specialty  disciplines  whose  criticism  of  the  design  later  in  the  program  is  so 
costly  and  painful,  and  (3)  follow-up  on  items  requiring  government  action 
is  inadequate  and  too  slow. 
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RELIABILITY  IN  THE  DEVELOPMENT  PHASE 


When  the  design  is  complete  on  paper  and  it  has  been  judged  satisfactory, 
the  construction  of  engineering  models  begin.  These  models  are  used  for  ex- 
tensive testing  to  insure  that  the  design  meets  all  of  its  specified  require- 
ments--both  performance  requirements  and  reliability  requirements.  Tests  are 
designed  so  that  as  much  as  possible,  one  test  will  provide  data  for  several 
different  purposes.  For  example,  a test  designed  primarily  to  evaluate  the 
performance  of  a radio  under  extremes  of  temperature  could  also  yield  valuable 
reliability  data  by  indicating  the  effect  of  temperature  on  electronic  module 
failures  modes.  This  is  an  example  of  integrated  testing,  and  it  is  a very 
important  aim  of  all  test  planning. 

Initial  reliability  performance  is  usually  only  a fraction  of  that  pre- 
dicted during  the  design  phase.  The  reasons  are  many:  unforeseen  circuit 

interactions,  unexpectedly  large  environmental  stresses,  poor  quality  parts, 
and  so  on.  Improvement  comes  through  long  hours  of  testing,  thorough  analysis 
of  all  failures,  and  fundamental  solutions  to  problems--in  short,  test,  analyze, 
and~fTx  (TAAF) . 

Testing 

In  terms  of  basic  methods,  there  are  generally  two  types  of  reliability 
testing:  environmental  testing  and  longevity  testing. 

o Environmental  testing  subjects  equipment  to  a host  of  environmental 
extremes  such  as  temperature,  shock,  vibration,  fog,  salt  water  spray,  fungus, 
mud,  etc.  The  purpose  of  this  testing  early  in  development  is  to  assess  the 
sensitivity  of  operating  parameters  to  various  environmental  stresses  and  to 
detect  unexpected  failure  modes.  Later  in  development,  environmental  testing 
is  used  to  demonstrate  that  a major  subsystem  or  equipment  is  unaffected  by 
specified  environmental  stresses.  A typical  environmental  test  profile  (one 
cycle  only)  for  temperature,  vibration,  and  on/off  switching  is  shown  in 
figure  18. 

o Longevity  Testing  evaluates  MTBF  trends  over  extended  periods  of 
operating  time.  The  earlier  described  test  of  resistors  was  a form  of 
longevity  testing  for  component  parts.  Unli>e  a component,  a piece  of 
equipment  is  a repairable  item.  When  a part  fails,  it  is  replaced  and  the 
test  continues.  MTBF  is  determined  by  (1)  operating  the  equipment  continuously, 
(2)  repairing  failures  as  they  occur,  (3)  noting  the  total  number  of  failures 
during  the  entire  test  period,  and  then  (4)  dividing  the  total  test  time  by 
the  total  number  of  failures.'  Figure  19  illustrates  this  procedure  for  a 
complex  weapon  control  system  which  experienced  67  failures  during  a 3000 
hour  test. 


^ Thi s calculation  of  MTBF  is  valid  only  when  the  equipment  follows  an 
exponential  reliability  distribution,  i.e.,  the  rate  at  which  failures  occur 
must  be  reasonably  constant.  Additionally,  in  order  for  this  procedure  to 
give  a true  indication  of  MTBF,  the  design  must  remain  fairly  stable  during 
the  test.  For  this  reason,  longevity  tests  are  not  very  meaningful  during 
the  early  "breadboard"  stages  of  development. 
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Failure  Analysis 


Testing  alone  does  not  improve  reliability.  It  merely  confirms  what  has 
been  designed  into  the  product.  Every  test  failure  must  be  recorded  along  with 
the  prevailing  test  conditions  and  painstakingly  analyzed.  First,  the  apparent 
cause  of  failure  must  be  isolated.  To  isolate  a failure,  the  reliability  engi- 
neer employs  a range  of  electrical,  mechanical,  and  chemical  tests,  chemical 
solvents,  and  optical  techniques  as  sophisticated  as  the  scanning  electron 
microscope.  He  literally  disassembles  the  failed  item  down  to  basic  raw 
materials,  if  necessary.  Figures  20  and  21  illustrate  typical  electronic 
component  failures.  Figure  20  pictures  (75X)  the  lower  left  corner  of  an 
integrated  circuit  which  was  contaminated  with  a small  drop  of  some  chemical 
(dark  arrow).  After  power  was  applied  during  operation,  heat  caused  the 
chemical  to  spread  until  it  caused  a partial  short  circuit  (light  arrow). 

Figure  21  shows  (also  75X)  a transistor  post  from  which  the  lead  became 
separated  after  power  was  applied. 

When  the  apparent  failure  has  been  isolated,  the  analyst  must  be  sure  he 
has  found  the  root  cause  of  the  problem.  Sometimes  a part  fails  for  a reason 
entirely  unto  itself.  Other  times  a part  begins  to  deteriorate,  but  as 
it  fails,  it  induces  a failure  in  a second  part.  (Reliability  specialists 
euphemistically  differentiate  these  types  of  failures  as  "suicides"  and 
"murders. ") 

Corrective  Action 


Once  the  failure  mechanism  is  thoroughly  understood,  the  reliability  engi- 
neer and  the  designer  work  together  to  provide  a fundamental  solution  to  the 
problem.  The  solution  may  be  simple  or  it  may  require  partial  redesign.  If 
the  problem  is  a component  quality  problem,  the  solution  may  be  to  require  the 
vendor  to  change  his  manufacturing  process  or  institute  tighter  quality  control 
on  his  current  process;  or  the  solution  may  be  to  use  a different  vendor.  As 
a recent  example,  repeated  test  failures  of  a particular  diode  used  in  elec- 
tronic modules  were  suspected  to  be  linked  to  the  plastic  material  used  to 
encapsulate  or  "pot"  the  module.  The  encapsulating  material,  which  was 
injected  into  the  module  under  pressure  and  heat,  was  suspected  of  causing 
an  excessive  mechanical  load  on  the  diode.  The  problem  was  greatly  reduced 
without  redesign  or  a change  in  materials,  merely  by  adding  a soft  plastic 
sleeve  around  the  diode  to  cushion  some  of  the  load. 
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MISSILE  RELIABILITY  GROWTH 
BY  MODEL 


Model  | Model  Model 


Reliability  Growth 


If  a development  program  has  a vigorous  reliability  effort  supported 
by  extensive  testing,  analyzing,  and  correcting,  the  reliability  of  the 
product  will  continue  to  improve.'  This  is  illustrated  in  Figure  22  for 
a typical  tactical  missile  development.  Testing  commences  in  early  develop- 
ment with  small  items  and  progressively  builds  up  to  major  items  of  equip- 
ment and  sub-assemblies.  Restrictions  of  money  and  time  sometimes  force 
elimination  of  some  step-by-step  testing  at  lower  levels.  However,  long 
experience  has  shown  that  solving  problems  at  lower  levels  is  much  easier 
and  less  costly  in  the  long  run  than  discovering  problems  during  major 
equipment  or  sub-assembly  level  testing. 

Reliability  testing  during  the  development  phase  usually  culminates 
in  a formal  reliability  demonstration  test.  The  sole  purpose  of  the 
reliability  demonstration  test  is  to  determine  before  award  of  a production 
contract,  whether  or  not  the  hardware  meets  the  specified  minimum  reliabi- 
lity requirement.  Ideally,  this  test  employs  hardware  which  has  been 
built  using  production  tooling,  test  equipment,  processes,  and  personnel. 

In  practice,  a formal  reliability  demonstration  test  is  sometimes  omitted 
if  previous  testing  has  sufficiently  demonstrated  reliability  and  if  the 
tooling  and  test  equipment  used  during  final  development  are  judged  to  be 
sufficiently  similar  to  the  production  items.  However,  there  are  obvious 
risks  associated  with  this  approach. 

Qualification  of  Parts 


Concurrent  with  the  development  of  a piece  of  hardware,  a contractor 
develops  a list  of  vendors  who  have  demonstrated  that  they  can  provide 
piece  parts  which  conform  to  all  specifications— including  reliability 
specifications.  This  is  usually  referred  to  as  vendor  qualification.  To 
become  qualified,  a vendor  usually  must  subject  his  parts  to  an  extensive 
test  program  which  includes  both  environmental  testing  and  longevity 
testing.  For  some  parts,  particularly  high-use  electronic  components, 
one  or  more  vendors  will  be  qualified  already. ^ For  other,  non-standard 
parts,  a contractor  must  develop  and  monitor  a qualification  test  program 
by  which  the  vendor  demonstrates  the  conformance  of  his  product.  If  a 
contractor  decides  to  make  a part  in-house,  he  too  must  subject  his  part 
to  a qualification  test  program. 

Generally,  every  reasonable  effort  is  made  to  have  more  than  one 
source  for  each  part.  Many  programs  have  suffered  substantial  delays  and 
cost  penalties  when  the  sole  qualified  source  for  a critical  part  exoeri- 
enced  difficulty.  The  process  of  qualifying  a new  supplier  is  both 
lengthy  (6-12  months)  and  expensive  ($10,000-$100,000) . 


1 Selby  and  Miller  (reference  23)  observed  that  for  a fixed  level  of 
reliability  engineering  effort,  the  improvement  in  reliability,  as 
measured  by  MTBF,  was  proportional  to  the  square  root  of  the  total 
cumulative  test  time. 

^Formal  government  lists  of  qualifed  vendors  exist  for  some  products 
by  type  and  are  referred  to  as  "Qualified  Products  Lists"  or  QPL. 
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A Major  Problem:  Demonstrated  vs.  Field  Reliability 


The  ultimate  objective  of  a reliability  program  is  to  develop,  produce, 
and  deploy  a piece  of  hardware  which  meets  a certain  level  of  reliability 
under  field  conditions.  The  reliability  demonstration  test  at  the  end  of 
the  development  phase  is  intended  to  confirm  that  an  acceptable  level  of 
reliability  has  been  reached.  Yet,  the  evidence  suggests  that  demonstration 
testing  does  not  adequately  fulfill  its  intended  function.  cigure  23 
illustrates  the  extent  to  which  some  typical  system  reliability  levels 
under  field  conditions  fall  short  of  levels  demonstrated  at  the  end  of 
development.  It  is  common  to  find  demonstrated  MTBF  to  field  MTBF  ratios  of 
5 or  10  to  1.  Why?  What  is  wrong  with  the  system?  There  are  many  reasons, 
but  two  major  causes  stand  out:  (1)  failure  to  test  to  actual  field  environ- 

ment, and  (2)  lack  of  uniformity  in  the  definition  of  failures. 

• Test  Environments  vs.  Field  Environments . Despite  the  designer's  best 
efforts  to  incorporate  in  the  design  all  aspects  of  the  actual  field  environ- 
ment, many  details  are  overlooked  or  simply  cannot  be  anticipated.  Unless 
subsequent  tests  duplicate  field  environments,  design  shortcomings  remain 
undetected  throughout  development.  Unfortunatel v the  military  specifications 
and  standards  which  prescribe  test  conditions  do  not  currently  provide  an 
"automatic"  test  of  all  severe  operational  environments.  The  current 
standards  were  developed  with  heavy  emphasis  on  standardization  of  test 
levels  in  order  to  economize  on  purchase  of  environmental  test  equipment. 

These  standard  test  levels  overtest  in  some  areas  and  undertest  in 
others.'  For  example.  Figure  24  shows  the  vibration  levels  experienced  by 
an  aircraft  forward-looking  radar  during  demonstration  testing.  The  upper 
curve  shows  the  actual  vibration  levels  experienced  in  field  operation.  The 
tremendous  difference  is  due  to  vibration  caused  by  firing  of  the  plane's 
guns--a  factor  which  certainly  should  have  been  tested  during  development. 

Help  is  on  the  way  in  this  area.  The  test  standards  are  currently  being 
revised  to  improve  tailoring  of  test  conditions  to  eouipment  end  use,  e.g., 
airborne,  missile,  ground  fixed  or  mobile,  and  shipboard.  This  will  help 
the  developer  to  systematically  require  testing  which  matches  the  most 
appropriate  and  most  severe  mission  profile. 

An  equally  serious  shortcoming  of  development  testing  is  the  failure  to 
adequately  consider  systematic  failure  modes  caused  by  maintenance  tech- 
niques. Over  the  years,  gains  in  technology  have  been  aimed  primarily  at 
increasing  performance,  with  inadequate  emphasis  on  designing  products  for 
ease  of  trouble-shooting  and  maintenance.  As  equipment  grows  increasingly 
complex,  maintenance  personnel  under  pressure  to  improve  "operational  readi- 
ness" resort  to  cannibalization  and  other  "quick  and  dirty"  mainte- 
nance technioues.  The  results  are  maintenance-induced  faults,  a large  percen- 
tage of  equipment  removed  which  is  later  found  to  be  without  defects,  and  a 
reduction  in  field  reliability.  Improvement  in  this  area  can  come  only 
through  increased  recognition  of  the  human  environment,  both  during  design 
and  during  testing.  If  an  operator's  judgement  is  the  failure  criterion  in 
the  field,  then  an  operator  should  be  included  in  demonstration  testing. 


Reference  27,  p.  32. 


o Test  Failures  vs.  Field  Failures  (Relevant/Non-relevant) . There 
is  generally  very  little  disagreement  on  the  results  of  performance  testing 
If,  for  example,  a voltage  output  of  12  volts  is  required,  there  is  little 
argument  over  whether  or  not  it  is  achieved,  because  the  definition  of  a 
"volt"  is  not  debatable.  With  reliability  demonstration  testing,  however, 
there  is  usually  considerable  disagreement  over  the  number  of  failures 
experienced.  The  reason  is  that  not  all  failures  are  counted  as  failures 
in  the  computation  of  MTBF.  Failures  which  are  caused  by  "a  condition 
external  to  the  equipment  under  test  which  is  not  a test  requirement  and 
not  encountered  in  service,"'  can  be  termed  "non-relevant"  and  discounted. 
Non-relevant  failures  can  stem  from  a variety  of  causes  such  as: 

(1)  Failures  directly  attributable  to  improper  equipment 
installation  in  the  test  chamber. 

(2)  Failures  of  test  instrumentation  or  monitoring 
equipment  (other  than  built-in  test  equipment). 

(3)  Failures  resulting  from  test  operator  error  or  test 
procedure  error  in  setting  up  or  testing  the 
equipment  (e.g.,  dropping  test  item). 

(4)  Failures  clearly  attributable  to  an  overstress  con- 
dition in  excess  of  the  design  requirements  (often 
user-induced,  e.g.,  improper  operation  or  mainte- 
nance in  an  operational  test). 

These  exceptions  are  equitable  and  probably  necessary,  but  with  such  a 
great  latitude  for  interpretation,  the  final  value  of  demonstrated  relia- 
bility is  usually  made  after  considerable  negotiation  between  the  govern- 
ment developer  and  the  contractor,  both  of  whom  are  naturally  interested 
in  getting  on  with  production.  The  result  is  a compromise  which  reclassifies 
many  of  the  failures  as  non-relevant. 

Of  course  the  field  environment  has  its  own  definition  of  a failure:  a 
failure  is  a failure  is  a failure!  In  the  field  all  failures  are  relevant, 
require  maintenance  effort,  and  reduce  reliability.  Figure  25  illustrates 
the  results  of  a 1971  study  on  operational  avionics  equipment  failures. 

Almost  half  of  the  failures  were  attributable  to  "other"  causes,  which 
would  normally  be  considered  non-relevant  during  demonstration  testing. 

There  is  no  easy  solution  to  this  problem,  but  several  actions  can 

help: 

(1)  Setting  reliability  requirements  which  are  reasonable 
and  will  not  force  the  contractor  to  rely  heavily  on 
"testmanship." 

(2)  Insuring  that  contractor  proposed  corrective  actions 
will  actually  correct  the  problem  and  not  induce  other 
failure  mechanisms. 

(3)  Duplicating  to  a far  greater  extent  during  testing,  the 
field  physical  and  human  environment,  including  data 
collection  and  analysis  procedures. 


Reference  22,  para  5. 5. 1.(1) 
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RELIABILITY  IN  THE  PRODUCTION  PHASE 


The  principal  objective  of  a reliability  effort  during  the  production 
phase  is  to  insure  that  the  reliability  inherent  in  the  design  at  the  end 
of  development  is  not  degraded  during  the  manufacturing  process.  This  is 
accomplished  primarily  by  assuring  that  incoming  purchased  parts  and 
materials,  manufacturing  processes,  and  inspection  procedures  all  conform 
to  strict  standards  which  allow  no  more  than  a very  small  percentage  of 
defective  items  to  pass  any  stage  in  the  manufacturing  cycle.  This  assurance 
effort  actually  comes  under  the  heading  of  quality  assurance,  which  involves 
assuring  the  quality  of  not  only  reliability  but  of  all  details  contained 
in  the  product  specifications J 

For  this  reason,  many  manufacturers  adminster  their  reliability 
programs  during  production  as  part  of  the  quality  assurance  program. 

However,  there  are  several  activites  which  are  distinctly  oriented 
toward  reliability  and  which  often  support  the  existence  of  a relia- 
bility organization  separate  from  the  quality  organization,  especially 
in  DOD  programs.  Two  of  the  most  important  of  these  activities  are  (1) 
insuring  the  continued  high  reliability  of  Incoming  parts  and  materials, 
and  (2)conducting  a reliability  demonstration  test  on  the  finished  items  of 
hardware. 

Reliability  of  Incoming  Parts 

The  reliability  level  of  purchased  parts  will  normally  have  been  estab- 
lished prior  to  production  by  some  sort  of  vendor  qualification.  But 
insuring  continued  high  reliability  is  a never-ending  vigil.  Vendors 
habitually  make  some  small  change  in  their  process  or  materials  which  affects 
a part's  reliability--without  informing  the  manufacturer.  The  slightly 
changed  part  usually  still  conforms  to  the  drawing;  therefore  the  change  is 
undetected  during  incoming  quality  control  inspections.  Unless  the  change 
affects  performance,  it  may  remain  undetected  for  some  time,  and  the  longer 
it  takes,  the  more  costly  will  be  the  repair  and  rework. 

This  problem  is  particularly  acute  in  the  case  of  electronic  com- 
ponents. To  protect  against  this,  many  manufacturers  subject  electronic 
components  to  an  environmental  screening  process  which  screens  out  latent 
defectives.  The  mainstay  of  the  screening  process  is  a burn-in.  i.e., 
operating  the  device  at  an  elevated  temperature  for  several  hundred  hours. 
Temperature  accelerates  aging  of  electronic  devices.  Therefore  burn-in 
effectively  operates  devices  through  most  of  their  infant  mortality 
period  and  weeds  out  many  latent  defectives.  Burn-in  at  the  component 
level  Is  not  100%  effective,  but  by  repeating  the  burn-in  at  the  next  higher 
manufacturing  level  (when  components  are  attached  to  printed  circuit  boards), 
the  number  of  defectives  can  usually  be  diminished  to  an  acceptably  low 
level . 


^Figure  26  shows  how  the  quality  control  inspections  are  used  during 
the  manufacture  of  high  quality  transistors. 
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Production  Reliability  Demonstration  Tests 

Reliability  demonstration  testing  during  production  confirms  that 
reliability  has  not  suffered  during  the  manufacturing  process,  and  that 
the  hardware  is  ready  for  the  field.  There  are  basicially  two  kinds  of 
production  reliability  tests:  (1)  an  extended  MTBF  test  performed  on  a 
small  sample  from  each  production  lot,  and  (2)  a shorter  screening  test 
performed  on  all  of  the  items  in  each  production  lot. 

• Extended  MTBF  Test.  This  test  is  conducted  on  a small  sample  ran- 
domly selected  from  a production  lot.  The  test  articles  are  operated 
continuously  while  being  subjected  to  environmental  extremes;  failures  are 
repaired  as  they  occur.  In  Figure  27  the  "stairstep"  plots  of  cumulative 
sample  test  times  and  failures  illustrate  three  possible  outcomes  for 
a typical  test  of  an  item  which  has  an  MTBF  requirement  of  200  hours: 

(A)  The  test  was  a failure  and  the  lot  rejected  because  sample 
failures  occurred  at  too  high  a rate.  The  eighth  failure 
occurred  after  approximately  200  hours  of  accrued  test 
time  and  forced  the  cumulative  plot  across  the  reject 
decision  boundary.  This  indicated  an  unacceptably  large 
risk  that  many  of  the  items  in  the  lot  would  have  an  MTBF 
below  the  200  hour  requirement. 

(B)  The  test  was  a success  and  the  lot  accepted  because  sample 
failures  occurred  at  a sufficiently  low  rate.  Only  6 
failures  had  occurred  when  the  cumulative  plot  crossed  the 
accept  decision  boundary  after  about  1230  hours  of  accrued 
test  time.  This  indicated  only  a small  risk  that  many  items 
in  the  lot  would  have  an  MTBF  below  the  200  hour  requirement. 

(C)  The  test  was  terminated  with  inconclusive  results.  After 
2100  total  test  hours  and  14  failures,  the  MTBF  of  the 
sample  was  neither  good  enough  nor  bad  enough  to  reach  an 
accept/reject  decision.  The  lot  was  conditionally  accepted, 
pending  contractor  correction  of  defects  indicated  by  the 
sample  testing. 

Obviously,  there  are  many  variables  in  an  MTBF  test:  sample  size,  level 

of  risk,  accept/reject  thresholds,  etc.  Sample  test  plans  for  a wide 
range  of  situations  are  given  in  MIl-STD-781. 

MTBF  sample  testing  is  very  useful,  but  it  has  features  which  can  be 
undesirable.  First,  the  test  lasts  a number  of  weeks,  during  which  the 
remainder  of  the  production  lot  is  either  held  in  "bond"  pending  the  out- 
come of  the  test  or  it  is  processed  onward  in  normal  fashion.  (In  this 
latter  case,  by  the  time  a reject  decision  is  reached,  substantial  quanti- 
ties of  hardware  could  already  be  fielded.)  Second,  if  the  sample  passes 
the  test,  there  is  still  a risk  that  some  items  in  the  lot  will  have  MTBF's 
substantially  below  the  requirement.  These  "lemons"  could  have  a detri- 
mental effect  on  field  operations.  Of  course,  every  item  in  the  lot 
could  be  subjected  to  an  MTBF  test,  but  the  cost  of  this  approach  is 
usually  prohibitive.  A compromise  is  offered  by  the  second  kind  of 
production  reliability  test,  the  "all  equipment  screening  test." 
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• All  Equipment  Screening  Tests  (sometimes  referred  to  as  "burn-in") 


This  approach  subjects  every  item  in  a lot  to  a minimum  amount  of 
operating  time  under  stressing  environmental  conditions.  All  failures  are 
analyzed  and  repaired,  and  every  item  must  have  a certain  period  of  failure- 
free  operation  in  order  to  pass.  A screening  test  acts  as  a "shake-down" 
to  weed  out  defects  not  visible  in  normal  quality  control  performance 
testing.  It  is  similar  to  the  screening  performed  on  components  to  weed 
out  infant  mortalities. 

Screening  tests  do  have  some  shortcomings.  Since  the  test  time  per 
item  is  much  less  than  in  extended  MTBF  tests,  screening  tests  do  not  yield 
very  confident  estimates  of  MTBF.  Additionally,  screening  tests  can  be 
more  expensive  because  of  the  requirement  for  a large  investment  in  test 
equipment.  However,  for  many  people,  these  disadvantages  are  outweighed 
by  the  very  beneficial  effect  of  subjecting  100%  of  all  items  to  some  kind 
of  reliability  testing. 


RELIABILITY  IN  THE  DEPLOYMENT  PHASE 


A reliability  program  does  not  stop  when  the  product  rolls  off  the 
production  line.  Field  use  invariably  uncovers  reliability  problems  which 
escape  detection  during  even  the  best  development  and  production  testing. 

The  problem  may  be  a latent  design  deficiency  or  (more  likely)  an  unantic- 
ipated failure  mode  which  appears  because  of  "green"  operating  and  mainten- 
ance personnel.  Some  improvement  in  field  reliability  is  usually  possible 
through  minor  design  modifications  or  changes  in  operating  and  maintenance 
procedures. 

The  military  departments  have  active  reliability  improvement  programs 
which  emphasize  collection  and  analysis  of  field  data,  identification  of 
specific  problems,  and  dedicated  funding  to  engineer  improvements.  In 
some  cases  significant  improvements  have  been  made.  For  example,  the  Army 
increased  the  MTBF  for  its  Vulcan  Air  Defense  System  from  30  hours  to  100 
hours,  which  will  yield  an  estimated  10  year  savings  of  $51  million. 

However,  improvements  such  as  this  should  not  overshadow  what  is  perhaps  the 
fundamental  principle  of  reliability:  reliability  is  design  in! 
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APPENDIX  A 


GENERAL  DESIGN  REVIEW  CHECKLIST1 

1.  Review  all  basic  parameters  included  in  the  data  package  for 
correctness  and  completeness. 

2.  Examine  the  subject  design  or  component  to  determine  if 
provisions  for  each  functional  requirement  have  been  included  in  the 
design.  Establish  the  feasibility  of  holding  these  to  specified  variability 
in  manufacture  and  define  the  level  of  confidence  that  must  be  generated 
to  assure  that  the  variability  is  within  limits. 

3.  Note  any  capabilities,  features,  accuracies  or  specified  tests 
which  are  beyond  the  state-of-the-*rt  or  beyond  the  functional  capabilities 
of  the  design  facilities. 

4.  Examine  the  design  approach  to  determine  if  the  simplest 
possible  means  for  obtaining  the  required  function  has  been  developed. 

5.  Determine  if  proven  (by  test  or  similar  application  history) 
components  and  parts  have  been  used  wherever  feasible. 

6.  Check  the  stress  analysis  (including  structural)  of  each 
component. 

7.  Compare  the  resistive  strengths  (and  any  established  allow- 
ables) of  each  material,  with  the  calculated  load  stresses  expected. 
Indicate  the  ranges  of  variability. 

1 Taken  from  reference  18 
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8.  Examine  the  possibility  and  effect  of  deflection  under  load 
of  each  component  or  part  on  the  performance  required.  Estimate 
the  effect  of  external  shock  and  resonant  vibrations  on  performance 
and  life  expectancy, 

9.  Determine  the  compatibility  of  materials  and  finishes  in  ex- 
pected environments.  If  data  is  not  available,  estimate  testing  re- 
quirements. 

10.  Consider  the  possibility  and  effects  of  predictable  wear  on  the 
maximum  allowable  tolerances,  as  related  to  the  performance  factors 
of  the  components. 

11.  Consider  the  possibility  and  the  effects  of  adverse  tolerance 
buildup  on  each  part,  including  the  effects  of  thermal  expansion,  vibration, 
and  differential  shock  excursions. 

12.  Consider  the  producibility  of  each  component  or  part  under 
the  manufacturing  conditions  in  which  it  will  be  built. 

13.  Consider  the  related  aspects  of  accessibility,  repairability, 
maintainability  (including  lubrication)  and  operability  under  field  con- 
ditions with  the  variabilities  of  skill  and  morale  of  personnel. 

14.  Consider  the  convenience,  special  tools  and  accuracy  required 
for  operational  adjustments,  and  control  instrumentation,  from  a human 
factors  standpoint. 
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15.  Consider  the  effects  of  associated  random  casualty  and 
permanent  shock  effects  on  the  performance  characteristic  of  the 
total  system. 

16.  Consider  the  compatibility  of  the  components  and  parts  with 
each  other  and  with  supporting  services  in  the  system. 

17.  Consider  the  installation  criteria  (handling,  alignment,  etc.  ) 
for  the  system,  component,  or  part  in  the  overall  arrangement. 

18.  Review  the  overall  evaluation,  summarize,  and  conclude, 

s 

noting:  i 

a.  The  possible  design  deficiencies,  including  contract  or 
specification  deficiencies  or  conflicts. 

b.  The  probable  and  possible  modes  of  failure  and  the  effect 
of  these  or  both  the  component  and  overall  system. 

c.  The  tests  deemed  necessary  to  establish  data  for  final 

reliability  assurance. 

d.  Any  inspection  procedures,  either  routine  or  special, 
which  would  help  uncover  most  likely  manufacturing  and  assembly  errors. 

e.  The  tests  deemed  necessary  to  fully  evaluate  performance 
vs.  design,  failure  modes,  and  overload  conditions. 

f.  For  parallel  components  or  other  components  that  can 
fail  without  causing  a detectable  system  malfunction,  list  the  periodic 
inspection  procedures  that  will  monitor  these  potential  failure  points. 
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